Telework "avatar work," in which people with disabilities can engage in physical work such as customer service, is being implemented in society. In order to enable avatar work in a variety of occupations, we propose a mobile sales system using a mobile frozen drink machine and an avatar robot "OriHime", focusing on mobile customer service like peddling. The effect of the peddling by the system on the customers are examined based on the results of video annotation.
translated by 谷歌翻译
为了在老年人的日常生活中实现连续的虚弱护理,我们向家里的老年人提出Ahobo,一位虚弱的护理机器人。通过AHOBO实施两种类型的支持系统,以支持身体健康和心理方面的老年人。对于身体健康的体力保健,我们专注于血压,并开发了一种用Ahobo血压测量的支持系统。对于心理脆弱的护理,我们将用Ahobo作为与机器人的娱乐活动实施着色的着色。根据日常生活中连续使用的假设,评估系统的可用性。对于血压测量的支持系统,我们对16名受试者的问卷进行了定性评估,包括系统血压测量的老年人。结果证实,该拟议的机器人不会影响血压读数,并且在基于主观评估的易用性方面是可接受的。为了使复兴的着色相互作用,在口头流畅性任务下对两名老年人进行了主观评估,并且已经证实了互动可以在日常生活中不断使用。拟议的机器人作为支持日常生活的AI的界面广泛使用将导致AI机器人支持从摇篮到坟墓的社会。
translated by 谷歌翻译
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.
translated by 谷歌翻译
Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model long- and short-range temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study conducted on each component confirms its effectiveness in the problem, and the extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on two commonly-used benchmark datasets in the VAD problem (UCF-Crime and ShanghaiTech Campus). The source code will be made publicly available upon acceptance.
translated by 谷歌翻译
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.
translated by 谷歌翻译
本文提出了一种用于拆分计算的神经体系结构搜索(NAS)方法。拆分计算是一种新兴的机器学习推理技术,可解决在物联网系统中部署深度学习的隐私和延迟挑战。在拆分计算中,神经网络模型通过网络使用Edge服务器和IoT设备进行了分离和合作处理。因此,神经网络模型的体系结构显着影响通信有效载荷大小,模型准确性和计算负载。在本文中,我们解决了优化神经网络体系结构以进行拆分计算的挑战。为此,我们提出了NASC,该NASC共同探讨了最佳模型架构和一个拆分点,以达到延迟需求(即,计算和通信的总延迟较小,都比某个阈值较小)。 NASC采用单发NAS,不需要重复模型培训进行计算高效的体系结构搜索。我们使用硬件(HW) - 基准数据的NAS基础的绩效评估表明,拟议的NASC可以改善``通信潜伏期和模型准确性''的权衡,即,将延迟降低了约40-60%,从基线降低了约40-60%有轻微的精度降解。
translated by 谷歌翻译
基于视频的自动化手术技能评估是协助年轻的外科学员,尤其是在资源贫乏地区的一项有前途的任务。现有作品通常诉诸CNN-LSTM联合框架,该框架对LSTM的长期关系建模在空间汇总的短期CNN功能上。但是,这种做法将不可避免地忽略了空间维度中工具,组织和背景等语义概念之间的差异,从而阻碍了随后的时间关系建模。在本文中,我们提出了一个新型的技能评估框架,视频语义聚合(Visa),该框架发现了不同的语义部分,并将它们汇总在时空维度上。语义部分的明确发现提供了一种解释性的可视化,以帮助理解神经网络的决策。它还使我们能够进一步合并辅助信息,例如运动学数据,以改善表示和性能。与最新方法相比,两个数据集的实验显示了签证的竞争力。源代码可在以下网址获得:bit.ly/miccai2022visa。
translated by 谷歌翻译
我们介绍了Realtime QA,这是一个动态的问答(QA)平台,该平台宣布问题并定期评估系统(此版本每周)。实时质量检查询问当前世界,质量检查系统需要回答有关新事件或信息的问题。因此,它挑战了QA数据集中的静态,常规假设,并追求瞬时应用。我们在包括GPT-3和T5在内的大型语言模型上建立了强大的基线模型。我们的基准是一项持续的努力,该初步报告在过去一个月中提出了实时评估结果。我们的实验结果表明,GPT-3通常可以根据新的退休文档正确更新其生成结果,从而突出了最新信息检索的重要性。尽管如此,我们发现GPT-3倾向于在检索文件时返回过时的答案,这些文件没有提供足够的信息来找到答案。这表明了未来研究的重要途径:开放式域质量检查系统是否可以确定无法回答的案例,并与用户甚至检索模块进行通信以修改检索结果?我们希望实时质量检查能够刺激问题答案及其他问题的瞬时应用。
translated by 谷歌翻译
我们引入了一个可扩展的框架,用于从RGB-D图像中具有很大不完整的场景覆盖率的新型视图合成。尽管生成的神经方法在2D图像上表现出了惊人的结果,但它们尚未达到相似的影像学结果,并结合了场景完成,在这种情况下,空间3D场景的理解是必不可少的。为此,我们提出了一条在基于网格的神经场景表示上执行的生成管道,通过以2.5D-3D-2.5D方式进行场景的分布来完成未观察到的场景部分。我们在3D空间中处理编码的图像特征,并具有几何完整网络和随后的纹理镶嵌网络,以推断缺失区域。最终可以通过与一致性的可区分渲染获得感性图像序列。全面的实验表明,我们方法的图形输出优于最新技术,尤其是在未观察到的场景部分中。
translated by 谷歌翻译
很少有动作识别旨在仅使用少量标记的训练样本识别新型动作类别。在这项工作中,我们提出了一种新颖的方法,该方法首先将每个视频汇总到由一组全球原型和一组集中原型组成的复合原型中,然后比较基于原型的视频相似性。鼓励每个全局原型总结整个视频中的特定方面,例如动作的开始/演变。由于没有针对全球原型提供明确的注释,因此我们使用一组专注的原型专注于视频中的某些时间戳。我们通过匹配支持视频和查询视频之间的复合原型来比较视频相似性。例如,从相同的角度来比较视频,以比较两个动作是否同样开始。对于集中的原型,由于动作在视频中具有各种时间变化,因此我们采用两分匹配,以比较具有不同时间位置和偏移的动作。实验表明,我们提出的方法在多个基准上实现了最先进的结果。
translated by 谷歌翻译